Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).
Given is the variable name, variable type, the measurement unit and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.
Name -- Data Type -- Measurement -- Description
- It is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., the Concrete compressive strength ).
- It is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction
Next step is to select a performance measure. A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.
#Import all the necessary modules
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (10,8)
import warnings
warnings.filterwarnings('ignore')
#!jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
from scipy.stats import zscore
from sklearn.decomposition import PCA
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_nodeinteractivity = 'all'
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
import xgboost
df = pd.read_csv("concrete_1.csv")
df.head()
#The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values
df.info()
#The describe() method shows a summary of the numerical attributes
df.describe().T
def basic_details(df):
b = pd.DataFrame()
b['Missing value'] = df.isnull().sum()
b['N unique value'] = df.nunique()
b['dtype'] = df.dtypes
return b
basic_details(df)
target = 'strength'
X = df.loc[:, df.columns!=target]
y = df.loc[:, df.columns==target]
While spliting the data, we may want to ensure that the test set is representative of the various categories of strength in the whole dataset. Since the strength is a continuous numerical attribute, we will have to trick Python into interpreting your continuous numerical y variable as a categorical variable instead. By creating bins, and passing y variable into an ndarray containing those bins and the corresponding values.
# Create the bins. My `y` variable has 1030 observations, and I want 50 bins.
bins = np.linspace(0, df.shape[0], 50)
# Save your Y values in a new ndarray,
# broken down by the bins created above.
y_binned = np.digitize(y, bins)
# Pass y_binned to the stratify argument,
# and sklearn will handle the rest
#Test train split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y_binned,random_state=2)
We have taken a quick glance at the data to get a general understanding of the kind of data. Now the goal is to go a little bit more in depth. First, we will make sure we have put the test set aside and we are only exploring the training set. Also, if the training set is very large, we may want to sample an exploration set, to make manipulations easy and fast. In our case, the set is quite small so we can just work directly on the full set.
df_temp = pd.concat([X_train,y_train],axis=1)
df_temp.hist(bins=30, figsize=(20,15))
plt.show()
import itertools
cols = [i for i in df_temp.columns if i not in 'strength']
length = len(cols)
cs = ["b","r","g","c","m","k","lime","c"]
fig = plt.figure(figsize=(13,25))
for i,j,k in itertools.zip_longest(cols,range(length),cs):
plt.subplot(5,2,j+1)
ax = sns.distplot(df_temp[i],color=k,rug=True)
ax.set_facecolor("w")
quartile_1,quartile_3 = np.percentile(df[i],[25,75])
plt.axvline(df_temp[i].median(),linestyle="dashed",label="median",color="k")
plt.axvline(df_temp[i].mean(),linestyle="dashed",label="mean",color="b")
plt.axvline(np.percentile(df_temp[i],25),linestyle="dashed",label="q1",color="r")
plt.axvline(np.percentile(df_temp[i],75),linestyle="dashed",label="q3",color="g")
plt.legend(loc="best")
plt.title(i,color="navy")
plt.xlabel("")
ax = sns.distplot(df_temp["strength"],color=k,rug=True)
ax.set_facecolor("w")
plt.axvline(df_temp["strength"].median(),linestyle="dashed",label="median",color="k")
plt.axvline(df_temp["strength"].mean(),linestyle="dashed",label="mean",color="b")
plt.axvline(np.percentile(df_temp["strength"],25),linestyle="dashed",label="q1",color="r")
plt.axvline(np.percentile(df_temp["strength"],75),linestyle="dashed",label="q3",color="g")
plt.legend(loc="best")
plt.title(i,color="navy")
plt.xlabel("")
sns.lineplot("age","strength",data=df)
We see that with age the strength of cement increases upto a certain level, and then after 100 days there is a sharp decrease in the strength and also after 250 days there seems to be decrease in the strength, it could be due to more moisture absorption from the atmosphere.
sns.set_style('darkgrid')
g = sns.FacetGrid(df_temp,hue="age",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'strength',bins=20,alpha=0.7)
df_temp["age"].value_counts().sort_values()
sns.set(context="paper", font="monospace")
# Create a figure instance
fig = plt.figure(1, figsize=(18, 12))
# Create an axes instance
ax = fig.add_subplot(111)
g = sns.boxplot(data=df_temp, ax=ax, color="blue")
g.set_xticklabels(df_temp.columns,rotation=90)
# Add transparency to colors
for patch in g.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .3))
For now, we would impute the ouliers with 1% and 95% values
def outlier(df,columns):
for i in columns:
quartile_1,quartile_3 = np.percentile(df[i],[25,75])
quartile_f,quartile_l = np.percentile(df[i],[1,95])
IQR = quartile_3-quartile_1
lower_bound = quartile_1 - (1.5*IQR)
upper_bound = quartile_3 + (1.5*IQR)
print(i,lower_bound,upper_bound,quartile_f,quartile_l)
df[i].loc[df[i] < lower_bound] = quartile_f
df[i].loc[df[i] > upper_bound] = quartile_l
outlier(df_temp, ['age','fineagg','water'])
def correlation_matrix(df):
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,8))
#plot heat map
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
correlation_matrix(df_temp)
# Lets check for highly correlated variables
cor= df_temp.corr()
cor.loc[:,:] = np.tril(cor,k=-1)
cor=cor.stack()
cor[(cor > 0.55) | (cor< -0.55)]
sns.pairplot(df_temp,diag_kind='kde')
We will analyse these features further, to identify key features to predict stength
## check Blast Furnace Slag, Ash and Cement proportion
df_temp[(df_temp['slag']!=0) & (df_temp['ash']!=0) ].head()
df_temp[(df_temp['ash']!=0) & (df_temp['superplastic']==0)]
## Function to train and evaluate models
def train_evaluate_model(X_train,y_train,X_test,y_test):
classifiers = [
Ridge(alpha=0.5),
Lasso(alpha=0.1),
SVR(),
GradientBoostingRegressor(random_state=2),
AdaBoostRegressor(random_state=2),
RandomForestRegressor(random_state=2),
LinearRegression(),
xgboost.XGBRegressor(random_state=2),
DecisionTreeRegressor(random_state=2)
]
# Logging for Visual Comparison
log_cols=["Classifier", "RMSE Train","RMSE Test", "R2 Train", "R2 Test"]
log = pd.DataFrame(columns=log_cols)
for clf in classifiers:
clf.fit(X_train, y_train)
name = clf.__class__.__name__
y_pred_train = clf.predict(X_train)
mse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
r2_train = r2_score(y_train, y_pred_train)
y_pred_test = clf.predict(X_test)
mse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2_test = r2_score(y_test, y_pred_test)
log_entry = pd.DataFrame([[name,mse_train,mse_test,r2_train,r2_test]], columns=log_cols)
log = log.append(log_entry)
log.set_index(["Classifier"],inplace=True)
return log.sort_values(by=['RMSE Test'])
initial_model_asis = train_evaluate_model(X_train,y_train,X_test,y_test)
initial_model_asis
from sklearn.preprocessing import PolynomialFeatures
polynomial_features = PolynomialFeatures(degree=2)
x_poly_train = polynomial_features.fit_transform(X_train)
x_poly_test = polynomial_features.fit_transform(X_test)
model_poly = train_evaluate_model(x_poly_train,y_train,x_poly_test,y_test)
model_poly
fs_model = RandomForestRegressor(n_estimators=100,oob_score=True,random_state=42)
fs_model.fit(X_train, y_train)
fs_model.feature_importances_
pd.DataFrame(fs_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(['Imp'],ascending=False).plot(kind='barh',figsize=[7,6])
fs_model_dt = DecisionTreeRegressor(random_state=42)
fs_model_dt.fit(X_train, y_train)
fs_model_dt.feature_importances_
pd.DataFrame(fs_model_dt.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(['Imp'],ascending=False).plot(kind='barh',figsize=[7,6])
A key factor in concrete strength is the water-cement ratio, so lets create a new feature names "water_cement_ratio"
df_temp["water_cement_ratio"] = df_temp["water"]/df_temp["cement"]
Next, we would create "coarse_fine_agg_ratio" using Coarse and Fine aggregate components
df_temp["coarse_fine_agg_ratio"] = df_temp["coarseagg"]/df_temp["fineagg"]
With a bit of ground work around the domain , I also found that water_binder_ratio is another factor which could help in predicting the cement strength. So, lets create a new feature - 'water_binder_ratio'.
df_temp["water_binder_ratio"] = df_temp["water"]/(df_temp["cement"] + df_temp["ash"] + df_temp["slag"])
df_temp.head()
sns.pairplot(df_temp,diag_kind='kde')
correlation_matrix(df_temp)
# #Drop a column
df_temp.drop('cement', axis=1, inplace=True)
df_temp.drop('water', axis=1, inplace=True)
df_temp.drop('coarseagg', axis=1, inplace=True)
df_temp.drop('fineagg', axis=1, inplace=True)
df_temp.drop('ash', axis=1, inplace=True)
df_temp.drop('slag', axis=1, inplace=True)
df_temp.info()
df_temp_test = X_test.copy()
df_temp_test["water_cement_ratio"] = df_temp_test["water"]/df_temp_test["cement"]
df_temp_test ["coarse_fine_agg_ratio"] = df_temp_test ["coarseagg"]/df_temp_test ["fineagg"]
df_temp_test["water_binder_ratio"] = df_temp_test["water"]/(df_temp_test["cement"] + df_temp_test["ash"] + df_temp_test["slag"])
# #Drop a column
df_temp_test.drop('cement', axis=1, inplace=True)
df_temp_test.drop('water', axis=1, inplace=True)
df_temp_test.drop('coarseagg', axis=1, inplace=True)
df_temp_test.drop('fineagg', axis=1, inplace=True)
df_temp_test.drop('ash', axis=1, inplace=True)
df_temp_test.drop('slag', axis=1, inplace=True)
target = 'strength'
X_train = df_temp.loc[:, df_temp.columns!=target]
y_train = df_temp.loc[:, df_temp.columns==target]
X_test = df_temp_test.loc[:, df_temp_test.columns!=target]
fe_model1 = train_evaluate_model(X_train,y_train,X_test,y_test)
fe_model1
fs_model = RandomForestRegressor(n_estimators=100,oob_score=True,random_state=42)
fs_model.fit(X_train, y_train)
fs_model.feature_importances_
pd.DataFrame(fs_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(['Imp'],ascending=False).plot(kind='barh',figsize=[7,6])
fs_model_dt = DecisionTreeRegressor(random_state=42)
fs_model_dt.fit(X_train, y_train)
fs_model_dt.feature_importances_
pd.DataFrame(fs_model_dt.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(['Imp'],ascending=False).plot(kind='barh',figsize=[7,6])
from sklearn.preprocessing import PolynomialFeatures
polynomial_features = PolynomialFeatures(degree=2)
x_poly_train = polynomial_features.fit_transform(X_train)
x_poly_test = polynomial_features.fit_transform(X_test)
fe_model_poly = train_evaluate_model(x_poly_train,y_train,x_poly_test,y_test)
fe_model_poly
from sklearn.cluster import KMeans
#check the optimal k value
ks = range(1, 9)
inertias = []
for k in ks:
model = KMeans(n_clusters=k)
model.fit(df_temp.drop('strength',axis=1))
inertias.append(model.inertia_)
plt.figure(figsize=(8,5))
plt.style.use('bmh')
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks)
plt.show()
from scipy.stats import zscore
# df_std = df_temp.drop(['strength'],axis=1).apply(zscore)
df_std = df_temp.apply(zscore)
kmeans = KMeans(n_clusters=3,random_state=4)
kmeans.fit(df_std)
# Check the number of data in each cluster
labels = kmeans.labels_
counts = np.bincount(labels[labels>=0])
print(counts)
df_std.columns
# Distribution looks fine.
# let us check the centers in each group
centroids = kmeans.cluster_centers_
centroid_df = pd.DataFrame(centroids, columns = list(df_temp))
centroid_df.transpose()
predictions = kmeans.predict(df_std)
predictions
df_std["group"] = predictions
df_std['group'] = df_std['group'].astype('category')
df_std.dtypes
# Visualize the centers
df_std.boxplot(by = 'group', layout=(3,4), figsize=(15, 10))
# Addressing outliers at group level
def replace(group):
median, std = group.median(), group.std() #Get the median and the standard deviation of every group
outliers = (group - median).abs() > 2*std # Subtract median from every member of each group. Take absolute values > 2std
group[outliers] = group.median()
return group
data_corrected = (df_std.groupby('group').transform(replace))
concat_data = data_corrected.join(pd.DataFrame(df_std['group']))
concat_data.boxplot(by = 'group', layout=(2,4), figsize=(15, 10))
Note: When we remove outliers and replace with median or mean, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them
df_group0=df_std[(df_std['group']==0)]
df_group1=df_std[(df_std['group']==1)]
df_group2=df_std[(df_std['group']==2)]
sns.pairplot(df_group0,diag_kind='kde')
sns.pairplot(df_group1,diag_kind='kde')
sns.pairplot(df_group2,diag_kind='kde')
# ash vs strength
var = 'water_binder_ratio'
with sns.axes_style("white"):
plot = sns.lmplot(var,'strength',data=concat_data,hue='group')
#plot.set(ylim = (-3,3))
# age vs strength
var = 'age'
with sns.axes_style("white"):
plot = sns.lmplot(var,'strength',data=concat_data,hue='group')
# superplastic vs strength
var = 'superplastic'
with sns.axes_style("white"):
plot = sns.lmplot(var,'strength',data=concat_data,hue='group')
# water_cement_ratio vs strength
var = 'water_cement_ratio'
with sns.axes_style("white"):
plot = sns.lmplot(var,'strength',data=concat_data,hue='group')
# coarse_fine_agg_ratio vs strength
var = 'coarse_fine_agg_ratio'
with sns.axes_style("white"):
plot = sns.lmplot(var,'strength',data=concat_data,hue='group')
There don't seem to be a clear seperation of clusters among features. It seems to be unlikely to get better results by splitting the data into clusters.
Set the range of hyperparameters to be tuned with RandonSearch for GradientBoost. Hyperparameters are the parameters that are initialized before training a model because they cannot be learned from the algorithm. They control the behavior of the training algorithm and have a high impact on the performance of a model.
from sklearn.model_selection import RandomizedSearchCV
import pprint
#Hyperparameters are the parameters that are initialized before training a model
#because they cannot be learned from the algorithm. They control the behavior of the training algorithm
#and have a high impact on the performance of a model
learning_rate = [0.01, 0.5, 0.1,1]
# Number of trees in GradientBoost
n_estimators = [int(x) for x in np.linspace(start = 100 , stop = 1000, num = 4)] # returns evenly spaced 10 numbers
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(3, 10, num = 4)] # returns evenly spaced numbers can be changed to any
max_depth.append(None)
#the fraction of observations to be used in individual tree
subsample = [0.8,0.9,1]
# Minimum number of samples required to split a node
min_samples_split = [0.5, 1.0, 2, 5, 8, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Create the random grid
rs_param_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'learning_rate':learning_rate,
'subsample':subsample}
pprint.pprint(rs_param_grid)
gbr_clf = GradientBoostingRegressor(random_state=2)
rf_random = RandomizedSearchCV(estimator=gbr_clf, param_distributions=rs_param_grid,
n_iter = 10, scoring='neg_mean_squared_error',
cv = 5, verbose=2, random_state=42, n_jobs=-1,
return_train_score=True)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
best_random = rf_random.best_estimator_
best_random.score(X_test , y_test)
Set the hyperparameters in the range closer to the one output by RandomizedSearchCV, to tune using GrinSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = {
'subsample': [0.8,0.85,0.9,0.95,1],
'n_estimators': [800,900,1000,1200],
'min_samples_split': [2,4,6],
'min_samples_leaf': [2],
'max_features': ['auto','sqrt'],
'max_depth': [3,5],
'learning_rate': [0.15, 0.1,0.05,0.01]
}
grid_search = GridSearchCV(estimator = gbr_clf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, return_train_score=True,scoring='neg_mean_squared_error')
# Fit the grid search to the data
grid_search.fit(X_train, y_train);
grid_search.best_params_
best_grid = grid_search.best_estimator_
best_score = best_grid.score(X_test, y_test)
best_score
best_grid
Save the model to a file, for later use.
import pickle
filename = 'concrete_model.sav'
pickle.dump(best_grid,open(filename,'wb'))
xgb_model = xgboost.XGBRegressor(learning_rate=0.01,
colsample_bytree = 1,
subsample = 0.8,
objective='reg:squarederror',
n_estimators=1000,
max_depth=3,
gamma=1)
eval_set = [(X_train, y_train), (X_test, y_test)]
%time xgb_model.fit(X_train, y_train, eval_metric='rmse', eval_set=eval_set, verbose=True)
evals_result = xgb_model.evals_result()
learning_rate = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ]
# Number of trees in GradientBoost
n_estimators = [int(x) for x in np.linspace(start = 100 , stop = 1000, num = 4)] # returns evenly spaced 10 numbers
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(3, 10, num = 6)] # returns evenly spaced numbers can be changed to any
#the fraction of observations to be used in individual tree
subsample = [0.8,0.9,1]
colsample_bytree = [0.8,0.9,1]
gamma = [0.0, 0.1, 0.2 , 0.3, 0.4, 1, 5, 10]
objective=['reg:squarederror']
#
#reg_alpha = [0.3,0.4,0.5]
# Create the random grid
rs_param_grid_xgb = {'n_estimators': n_estimators,
'max_depth': max_depth,
'subsample': subsample,
'colsample_bytree': colsample_bytree,
'gamma': gamma,
'learning_rate':learning_rate,
'objective':objective
}
pprint.pprint(rs_param_grid_xgb)
rf_random_xgb = RandomizedSearchCV(estimator=xgb_model, param_distributions=rs_param_grid_xgb,
n_iter = 10, scoring='neg_mean_squared_error',
cv = 5, verbose=2, random_state=42, n_jobs=-1,
return_train_score=True)
# Fit the random search model
rf_random_xgb.fit(X_train, y_train)
rf_random_xgb.best_params_
best_random_xgb = rf_random_xgb.best_estimator_
best_random_xgb.score(X_test , y_test)
param_grid_xgb = {
'colsample_bytree': [0.8, 0.9, 1],
'subsample': [0.8,0.9,1],
'n_estimators': [600,800,1000],
'max_depth': [3, 5, 7, 10],
'gamma': [0, 1],
'learning_rate': [0.01, 0.05, 0.1],
#'reg_alpha': [0.3, 0.4, 0.5],
'objective': ['reg:squarederror']
}
grid_search_xgb = GridSearchCV(estimator = xgb_model, param_grid = param_grid_xgb,
cv = 5, n_jobs = -1, verbose = 2, return_train_score=True,scoring='neg_mean_squared_error')
# Fit the grid search to the data
grid_search_xgb.fit(X_train, y_train);
grid_search_xgb.best_params_
best_grid_xgb = grid_search_xgb.best_estimator_
best_score_xgb = best_grid_xgb.score(X_test, y_test)
best_score_xgb
from sklearn.utils import resample
import pickle
filename = 'concrete_model.sav'
load_lr_model=pickle.load(open(filename,'rb'))
df_train = X_train.join(y_train)
df_test = X_test.join(y_test)
df_row_merged = pd.concat([df_train, df_test], ignore_index=True)
values = df_row_merged.values
#values = df_test.values
# configure bootstrap
n_iterations = 1000
n_size = int(len(X_train) * 1)
# run bootstrap
stats = list()
for i in range(n_iterations):
# prepare train and test sets
train = resample(values, n_samples=n_size)
test = np.array([x for x in values if x.tolist() not in train.tolist()])
# fit model
model = load_lr_model
model.fit(train[:,:-1], train[:,-1])
y_test_array = test[:,-1]
# evaluate model
predictions = model.predict(test[:,:-1])
score = model.score(test[:, :-1] , y_test_array)
stats.append(score)
# plot scores
plt.hist(stats)
plt.show()
# confidence intervals
alpha = 0.95
p = ((1.0-alpha)/2.0) * 100
lower = max(0.0, np.percentile(stats, p))
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))
Then, created composite features -
We observered a lot of correlation with the composite features and the base features using which they are derived. So. lets drop cement, water , coarseagg , fineagg , ash, slag columns as we have derived new features out of these
We also explored the data for any clusters as there appeared to be multiple gaussians in the independent features. We split the data into 3 clusters using KMean clustering. However, no distinct clusters seemed to be visible with the avaiable features.
Then we trained the model using various algorithms and evalutaed the performance of the model
We tuned the hyperparameters for GradientBoost and XGBoost using RandomizedSearch and GridSearch, found that both were giving similar performance.